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Editor: Abstract 

For 1 dimension reduction in 1%, the method of Cauchy random projections multiplies the orig- 
inal data matrix A S R nx£) with a random matrix R G K. £)><fc (k <§; min(n, D)) whose entries 
are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility results, one can 
not hope to recover the pairwise l\ distances in A from B = AR G R raxfc , using linear estimators 
without incurring large errors. However, nonlinear estimators are still useful for certain applications 
in data stream computation, information retrieval, learning, and data mining. 

We propose three types of nonlinear estimators: the bias-corrected sample median estimator, 
the bias-corrected geometric mean estimator, and the bias-corrected maximum likelihood estimator. 
The sample median estimator and the geometric mean estimator are asymptotically (as k — > oo) 
equivalent but the latter is more accurate at small k. We derive explicit tail bounds for the geometric 
mean estimator and establish an analog of the Johnson-Lindenstrauss (JL) lemma for dimension 
reduction in l\, which is weaker than the classical JL lemma for dimension reduction in l<z. 

Asymptotically, both the sample median estimator and the geometric mean estimators are about 
80% efficient compared to the maximum likelihood estimator (MLE). We analyze the moments of 
the MLE and propose approximating the distribution of the MLE by an inverse Gaussian. 

Keywords: Dimension reduction, l\ norm, Cauchy Random projections, JL bound 



1. Introduction 

This paper focuses on dimension reduction in l\, in particular, on the method based on Cauchy 
random projections (Indyk, 2000), which is special case of linear random projections. 

The idea of linear random projections is to multiply the original data matrix A G M nxD with 
a random projection matrix R G M Dxfe , resulting in a projected matrix B = AR G R nxfe . If 
k -C min(n, D), then it should be much more efficient to compute certain summary statistics (e.g., 

1. Revised February 1, 2008. The original version, titled Practical Procedures for Dimension Reduction in h, is avail- 
able as a technical report in Stanford Statistics achive (report No. 2006-04, June, 2006). 
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pairwise distances) from B as opposed to A. Moreover, B may be small enough to reside in physical 
memory while A is often too large to fit in the main memory. 

T he choice of the random projection matrix R depends on which norm we would like to work 
with. Indvk (2000) proposed constructing R from i.i.d. samples of p -stable distributio ns, for di- 



mension reduction in l p (0 < p < 2). In the stable distribution family (Zolotarev, 1986), normal is 
2-stable and Cauchy is 1-stable. Thus, we will call random projections for 1% and l\, normal random 
projections and Cauchy random pr ojections, respect ively. 

In normal random projections dVempalal l2004) . we can estimate the original pairwise I2 dis- 
tances of A directly using the corresponding I2 dist ances of B (up to a normalizing constant). 
Furthermore, the Johnson-Lindenstrauss (JL) lemma ( Johnso n and Lindenstraussl . 19841) provides 



the performance guarantee. We will review normal random projections in more detail in Section|2] 
For Cauchy random projections, we should not use the l\ distance in B to approximate the 
original l\ distance i n A, as the Cauchy distribution does not even have a finite first moment. The 



impossibility results (Brinkman and Charikar, 2003; Lee and Naor, 2004; Brinkman and Charikar 



l2005h have proved that one can not hope to recover the £1 distance using linear projections and linear 
estimators (e.g., sample mean), without incurring large errors. Fortunately, the impossibility results 
do not rule out nonlinear estimators, which may be still useful in certain applications in data stream 
co mputation, inf ormation retrieval, learning, and data mining. 

Indvkl (2000) proposed using the sample median (instead of the sample mean) in Cauchy ran- 



dom projections and described its application in data stream computation. In this study, we provide 
three types of nonlinear estimators: the bias-corrected sample median estimator, the bias-corrected 
geometric mean estimator, and the bias-corrected maximum likelihood estimator. The sample me- 
dian estimator and the geometric mean estimator are asymptotically equivalent (i.e., both are about 
80% efficient as the maximum likelihood estimator), but the latter is more accurate at small sample 
size k. Furthermore, we derive explicit tail bounds for the bias-corrected geometric mean estimator 
and establish an analog of the JL Lemma for dimension reduction in l\. 

This analog of the JL Lemma for l\ is weaker than the classical JL Lemma for I2, as the 
geometric mean estimator is a non-convex norm and hence is not a metric. Many efficient al- 
gorithms, such as some sub -linear time (using super-linear memory) nearest neighbor algorithms 



( Shakhna rovich et al! l2005). rely on the metric properties (e.g., the triangle inequality). Neverthe- 
less, nonlinear estimators may be still useful in important scenarios. 

• Estimating l\ distances online 

The original data matrix A 6 IR TlxD requires 0(nD) storage space; and hence it is often 
too large for physical memory. The storage cost of all pairwise distances is 0{n 2 ), which 
may be also too large for the memory. For example, in information retrieval, n could be the 
total number of word types or documents at Web scale. To avoid page fault, it may be more 
efficient to estimate the distances on the fly from the projected data matrix B in the memory. 

• Computing all pairwise l\ distances 

In distance-based clustering and classification applications, we need to compute all pairwise 
distances in A, at the cost of time 0(n 2 D). Using Cauchy random projections, the cost can 
be reduced to 0(nDk + n 2 k). Because k -C min(n, D), the savings could be enormous. 

• Linear scan nearest neighbor searching 

We can always search for the nearest neighbors by linear scans. When working with the pro- 
jected data matrix B (which is in the memory), the cost of searching for the nearest neighbor 



2 



Cauchy Random Projections 



for one data point is time 0(nk), which may be still significantly faster than the sub-linear 
algorithms working with the original data matrix A (which is often on the disk). 

We briefly comment on coordinate sampling, another strategy for dimension reduction. Given 
a data matrix A G M nxD , one can randomly sample k columns from A and estimate the sum- 
mary statistics (including l\ and I2 distances). Despite its simplicity, there are two major disad- 
vantages in coordinate sampling. First, there is no performance guarantee. For heavy-tailed data, 
we may have to choose k very large in order to achieve sufficient accuracy . Second, large datasets 
are often highly sparse, for example, text da ta ( Dhillon and Modha 1 200 lb a nd m arket-basket data 



(Aggarwal and Wolf, 1999; Istrehl an d Ghosh, 2000V ll.i and Churchl fc005h and lUsLalJ £o06a) 



provide an alternative coordinate sampling strategy, called Conditional Random Sampling (CRS), 
suitable for sparse data. For non-sparse data, however, methods based on linear random projections 
are superior. 

The rest of the paper is organized as follows. Section previews linear random projections. 
Section |3] summarizes the main results for three types of nonlinear estimators. Section 0] presents 
the sample median estimators. Section |5] concerns the geometric mean estimators. Section |6] is 
devoted to the maximum likelihood estimators. Section ^concludes the paper. 

2. Introduction to Linear Random Projections 

We give a review on linear random projections, including normal and Cauchy random projections. 

Denote the original data matrix by A G R nx£ \ i.e., n data points in D dimensions. Let 
{uj}f =1 G R D be the ith row of A. Let R G M Dxfc be a random matrix whose entries are i.i.d. 
samples of some random variable. The projected data matrix B = AR G jR nxfe . Denote the entries 
of R by {rij}f =1 k - =l and let {vj}f =1 G M. k be the ith row of B. Then vi = R T iij, with entries 



Vi 



Rj Ui, i.i.d. j = 1 to k, where Hj is the jth column of R. 



For simplicity, we focus on the leading two rows, u\ and 112, in A, and the leading two rows, v± 
and V2, in B. Define {xj} k =l to be 

D 

x j = v iJ ~ v 2,j = '^2r ij - u 2 ,i) , j = l,2,...,k (1) 

i=i 



If we sample rij i.i.d. from a stable distribution (Zolotarev, 1986: llndykl 2000h . then x/s are 



also i.i.d. samples of the same stable distribution with a different scale parameter. In the family of 
stable distributions, normal and Cauchy are two important special cases. 

2.1 Normal Random Projections 

When rij is sampled from the standard normal, i.e., ~ iV(0, 1), i.i.d., then 

D / D \ 

i=l \ i=l J 

because a weighted sum of normals is also normal. 
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Denote the squared 1% distance between u\ and U2 by d\ 2 
We can estimate di 2 from the sample squared I2 distance: 



\U\ - U2W2 



1 ~ Ya=1 \ U l,i ~ u 2,i\ 2 - 



1 k 



Xj. 



(3) 



j'=i 



It is easy to show that (e.g., JVemnalaLEoollLi et all E)06b)) 

2 



E dt, = d i2 , 



Var 



Pr 



dj 2 - ^2 



A; 

> e^ 2 ) < 2exp I --e 2 + -e 



e > 



(4) 
(5) 



We would like to bound the error probability Pr 



total 



n(n— 1) 



du — d\ 2 



> edi 2 ) by 5. Since there are in 



< \ pairs among n data points, we need to bound the tail probabilities simultaneously 



for all pairs. By the Bonferroni union bound, it suffices if 

,2 



n 



Pr 



di 2 - d. 



> ed l2 ) < 5. 



Using (|5Jl, it suffices if 



ft ' ( k o k o 

2 log n — log 5 



< 5 



>k > 



e 2/ 4 _ £ 3/ 6 



(6) 

(V) 
(8) 



Therefore, we obtain one version of the JL lemma: 

Ifk> 2 ly^L^Q > then with probability at least 1 — 5, the squared I2 distance between any pair 
of data points (among n data points) can be approximated within 1 ± e fraction of the truth, using 

the squared I2 distance of the projected data after no rmal random projections. 

M any versions of the JL lemma have been proved Jjqhnson and LindenstraussLll984l:lFrarikl and MaeharaL 
19871 : llndvk and MotwaniL Il998l: lArriaga and VempalaL Il999l: basgupta and GuptaL 120031 : llndvkL 
200fi l200ll: lAchlioptasL l2003l: lArriaga and VempalaL l2004 kilon and ChazelldbOOfih . 

Note that we do not have to use r« ~ N(0, 1) for dimension reduction in l 2 . For example, 



we can sample r ij from some sub-Gaussian distributions ( Indvk and Naor , 2006b . in particular, the 



following sparse projection distribution: 



1 with prob. ^ 
with prob. 1 - 
- 1 with prob. 77- 



(9) 



When 1 < s < 3. lAchlioptasl J2003h proved the JL lemma for the above sparse proiect ion, which 
can also be shown by sub-Gaussian analysis ( Li et alll2006c ). Recently. iLi et all d200 6d) proposed 
very sparse random projections using s = VD in (|9j, based on two practical considerations: 



D should be very large, otherwise there would be no need for dimension reduction. 
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• The original 1% distance should make engineering sense, in that the second (or higher) mo- 
ments should be bounded (otherwise various term-weighting schemes will be applied). 

Based on these two practical assumptions, the projected data are asymptotically normal at a fast 
rate of convergence when s = \f~D. Of course, very sparse random projections do not have worst 
case performance guarantees. 

2.2 Cauchy Random Projections 

In Cauchy random projections, we sample m i.i.d. from t he standard Cauchy distribution, i.e., 
~ C(0, 1). By the 1 -stability of Cauchy dZolotarevLfl98^ . we know that 



D 



v i, 



c o,^Ki-«2,i| ■ (io) 



That is, the projected differences Xj = vij — V2j are also Cauchy random variables with the scale 
parameter being the l\ distance, d=\u\ — u%\ = Yli=i \ u i,i ~ n 2,i|> m me original space. 
Recall that a Cauchy random variable z ~ C(0, 7) has the density 

7 1 

f( z ) = TT~^> ^ >0 ' -oo<z<oo (11) 

7T Z z + 7 Z 

The easiest way to see the 1 -stability is via the characteristic function, 

E (exp( % /^Tz 1 t)) = exp (- 7 |t|) , (12) 

E exp \/-Lty]ciZi = exp -7 V] \q\t , (13) 




for z\, z-i, zr>, i.i.d. C(0,7), and any constants c\, C2, cp. 

Therefore, in Cauchy random projections, the problem boils down to estimating the Cauchy 
scale parameter of C(0,d) from k i.i.d. samples Xj ~ C(0, d). Unfortunately, unlike in normal 
random projections, we can no longer estimate d from the sample mean (i.e., | Ylj=i \ x j\) because 



E (Xj) = 00. 



Although the impossibility results (Lee and Naoi . 20041 : iBrinkma n and Charikail l2005h have 



ruled out estimators that are metrics, there i s enough in formation to recover d from k samples 
{xj}j =1 , with a high accuracy. For example. Ilndvkl (I2000h proposed using the sample median as 
an estimator. The problem with the sample median estimator is the inaccuracy at small k and the 
difficulty in deriving explicit tail bounds needed for determining the sample size k. 

This study focuses on deriving better estimators and explicit tail bounds for Cauchy random 
projections. Our main results are summarized in the next section, before we present the detailed 
derivations. Casual readers may skip these derivations after Section [5] 

3. Main Results 

We propose three types of nonlinear estimators: the bias-corrected sample median estimator (d me>c ), 
the bias-corrected geometric mean estimator (d grnjC ), and the bias-corrected maximum likelihood 
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estimator ((Imle,c)- dme,c and d gm ,c are asymptotically equivalent but the latter is more accurate at 
small sample size k. In addition, we derive explicit tail bounds for d gmfi , from which an analog of 
the Johnson-Lindenstrauss (JL) lemma for dimension reduction in l\ follows. Asymptotically, both 

d me .c. and d 



gm,c 



are ^ « 80% efficient compared to the maximum likelihood estimator (Imle,c- 
We propose accurate approximations to the distribution and tail bounds of cImle,^ while the exact 
closed-form answers are not attainable. 

3.1 The Bias-corrected Sample Median Estimator 

Denoted by d mejC , the bias-corrected sample median estimator is 

dry). P. 



where 



dr. 



d me =mediaa(\xj\,j = 1,2,..., k) 
" l (2m + 1)! /vr 



tan (J^tj (t-t 2 ) m dt, k = 2m+l 



Here, for convenience, we only consider k = 2m + 1, m = 1, 2, 3, ... 
Some key properties of d me ,c- 

• E [d me ,^j = d, i.e, d me ,c is unbiased. 

• When k > 5, the variance of d me ,c is 



Var 



i^dme,c^ 



d 1 



( 



{2m + i y-(^t^^t){t-t^) m dt) 2 ) 



Var [d meyC j = oo if k = 3. 
• As A; — ► oo, c2 mejC converges to a normal in distribution 



3.2 The Bias-corrected Geometric Mean Estimator 

Denoted by d gm , c , the bias-corrected geometric mean estimator is defined as 



(14) 

(15) 
(16) 



fc > 5 (17) 



(18) 



(19) 



Important properties of d gmfi include: 
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• This estimator is a non-convex norm, i.e., the l p norm with p — ► 0. 

• It is unbiased, i.e., E (d gm ,c^ = d. 

• Its variance is (for k > 2) 



V- (<W.) = [ - 1 ) = TT + 32 F + O J . (20) 



• For < e < 1, its tail bounds can be represented in exponential forms 

J2 



Pr 



(d gm ,c ~ d > ed) < exp ^-fc ^ 8 ^ e +e ^ ) (21) 
Pr (<W " d < -«*) < ex P (-* ) , fc > ^ (22) 

• These exponential tail bounds yield an analog of the Johnson-Lindenstrauss (JL) lemma for 
dimension reduction in l±: 

Ifk> 8( ' 2 '°/(" + 1 °) g ' 5 ' > > YEl' t ^ ien Wlf ^ probability at least 1 — 5, one can recover the original 
l\ distance between any pair of data points (among all n data points) within lie (0 < e < 1) 
fraction of the truth, using d grritC , i.e., \d gm>c — d\ < ed. 

3.3 The Bias-corrected Maximum Likelihood Estimator 

Denoted by <1mle,c, the bias-corrected maximum likelihood estimator is 

dMLE,c = d-MLE ^1 - -j^j , (23) 
where duhE solves a nonlinear MLE equation 

-J^ + j^ 2iMLE = 0. (24) 
duhE j=1 x 2 + d 2 MLE 

Some properties of cImle,c' 

• It is nearly unbiased, E (dMLE,cj = d + O (p-)- 

• Its asymptotic variance is 

72 qj2 



Var 



(W, c )=- + lx + 0^j, (25) 



. Var(j MLfi , e ) g Var( ( ?MLiS,c) 8 r. / 8 _ ontWN 

1 - e -' T71 ^ 7, n \ > 7?' as k > co. ~ OU/o; 
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Its distribution can be accurately approximated by an inverse Gaussian, at least in the small 
deviation range. Based on the inverse Gaussian approximation, we suggest the following 
approximate tail bound 

e 2 /(l+f ' 

))' 

which has been verified by simulations for the tail probability > 1CT 10 range 



Pr ( \d M LE,c -d\> ed) < 2exp I ~ 2 ^ 3 k 1 . < e < 1, (26) 



4. The Sample Median Estimators 

Recall in Cauchy random projections, B = AR, we denote the leading two rows in A by u\, u 2 
6 M. D , and the leading two rows in B by v\, v% G M. k . Our goal is to estimate the l\ distance 
d=\u\- u 2 \ = Ya=i Im. T - tta J from {xj} k j=1 , xj = vij - v 2 j ~ C(0, d), i.i.d. 

It is easy to show (e.g.. Ilndvkl (12000)') that the population median of \xj \ is d. Therefore, it is 



natural to consider estimating d from the sample median, 

d me = median{|xj|, j = 1,2, ...,&}. (27) 

As illustrated in the following lemma (proved in Appendix EJ, the sample median estimator, 
d me , is asymptotically unbiased and normal. For small samples (e.g., k < 20), however, d me is 
severely biased. 

Lemma 1 The sample median estimator, d me , defined in (12 7t . is asymptotically unbiased and nor- 
mal 

Vk {dme -d)=^>N ^0, ^d 2 ^j (28) 
When k = 2m + 1, m = 1, 2, 3, the r th moment of d me can be represented as 



(ml) 2 \2 



E(d me ) =d r (l ,\ 2 V " tan r ($t) (t-t 2 ) m dt) , m>r (29) 



If m < r, then E I d me I = oo 



For simplicity, we only consider k = 2m + 1 when evaluating E ( d r , 
Once we know E ( d me ) , we can remove the bias of d me using 



; _ d me 

Ome 



where the bias correction factor b me is 



K[d me ) f i ( 2m + i)! s „ 



d J (m!) 2 V2 
b me can be numerically evaluated and tabulated, at least for small k. 



tan(-t) (t-t 2 ) dt. (31) 



2. It is possible to express b me as an infinite sum. Note that (2 ( "+y )! (t - t 2 ) m , < t < 1, is the probability density 
of a Beta distribution Beta(m + 1, m + 1). 
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Obviously, d me>c is unbiased, i.e., E [d mefi 
( 



Var [d 



d 2 



d. Its variance would be 

(ml) 2 Jo tan 2 (ft) (t - t 2 ) m dt 



V 



(2m + 1)! 



o 1 tan (ft) (t-t 2 ) m dt 



/c = 2m + 1 > 5 (32) 



N(0,^d 2 



Of course, d 3mjC and d gm are asymptotically equivalent, i.e., \fk \ d m e,c — d 

Figure^plots b me as a function of k, indicating that d me is severely biased when k < 20. When 
k > 50, the bias becomes negligible. Note that, because b me > 1, the bias correction not only 
removes the bias of d me but also reduces its variance. 



1.7 

h 1-6 

3 1.5 

■B 1.4 

o 

g 1.3 
o 

° 1 2 

in 1.1 
1 



' -" MOOOM gti 



Figure 1 : The bias collection factor, b m , 
bias is negligible. Note that b n 



5 10 15 20 25 30 35 40 45 50 
Sample size k 

in d3TT >. as a function of k 
P = oo when k = 1. 



2m + 1. After > 50, the 



The sample median is a special case of sample quantile e stimators ( Fama and Roll 1968L 1971 ). 
For example, one version of the quantile estimators given bv lMcCullochl dl986h would be 



■>' 



75 



■V 



25 



2.0 



(33) 



where \x\ 75 and \x\ 25 are the .75 and .25 sample quantiles of {|xj|}^ =1 , respectively. 

Our simulations indicate that d me actually slightly outperforms (i or . This is not surprising. <i or 
works for any Cauchy distribution whose location parameter does not have to be zero, while d me 
takes advantage of the fact that the Cauchy location parameter is always zero in our case. 



5. The Geometric Mean Estimators 

This section derives estimators based on the geometric mean, which are more accurate than the 
sample median estimators. The geometric mean estimators allow us to derive tail bounds in ex- 
plicit forms and (consequently) an analog of the johnson-Lindenstrauss (JL) lemma for dimension 
reduction in l\. 

Recall, our goal is to estimate d from k i.i.d. samples Xj ~ C(0, d). To help derive the ge- 
ometric mean estimators, we first study two nonlinear estimators based on the fractional moment, 
i.e., E(|x| A ) (|A| < 1) and the logarithmic moment, i.e, E (log(|a;|)), respectively, as presented in 
Lemma|2l See the proof in Appendix iBl 
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Lemma 2 Assume x ~ C(0, d). Then 

e(\x\ x )= f |A|<1 (34) 

V / COS(A7T/2) 

E(\og(\x\)) =log(d), (35) 

%r(log(|a;|)) = ^, (36) 

/rom which we can derive two biased estimators of d from k Ltd. samples xj ~ C(0, d): 

l/X 

d\ = I f > ; |x 7 | A cos(A^/2) I , |A| < 1, (37) 



i^|^fcos(A7r/2)j 
4> 9 = exp I -^log(|x 3 |) I , 



w/jo^e variances are, respectively, 



(38) 



MrfA)=^^ + o(A). IAI < 1/2 (39) 



ir 2 d 2 / 1 

Var ( d Zo „ ) = _— + O — ] . (40) 



k A 2 cos(A7r) \k 2 

7T 2 d 2 / 1 

'"■"J = ^F + lF 

r/jg term ^r~r^ decreases with decreasing |A|, reaching a limit 

sin 2 (Avr/2) ^ 2 
lim x9 ' ( = — . (41) 

A^o A 2 cos(A7r) 4 

of/jer vvort/^, the variance of d\ converges to that of ' di og as | A| approaches zero. 

Note that ci; 09 can in fact be written as the geometric mean: 

k 

dio g = d gm = H\x j \ 1 / k . (42) 

3=1 

d\ is a non-convex norm because A < 1. d gm is also a non-convex norm (the l\ norm as 
A — * 0). Both d\ and d 9m do not satisfy the triangle inequality. 

We propose d grriiC , the bias-corrected geometric mean estimator. Lemma[3]derives the moments 
of d gm c , proved in Appendix ICl 

Lemma 3 

k 

4m,c = COS fc (^)[]|x i | 1 ^, k>l (43) 
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is unbiased, with the variance ( valid when k > 2 ) 

MM-'(5$-)-" + w +0 (*)- - 

The third and fourth central moments are (for k > 3 and k > 4, respectively) 

E (d gm , c - E (4 m , c )) 3 = ^ + O (p) (45) 
E(d gm>c -E[d gm>c )) =—- + 0l Y A. (46) 



The higher (third or fourth) moments may be useful for approximating the distribution of d gm ,c- 
In Section|6l we will show how to approximate the distribution of the maximum likelihood estimator 
by matching the first four moments (in the leading terms). We could apply the similar technique to 
approximate d gm ,c- Fortunately, we do not have to do so because we are able to derive the exact tail 
bounds of d gmfi in Lemma|31 which is proved in Appendix ID1 



Lemma 4 



cos * f — 1 

PrK m , c > (l + e)d) < ,^ K2kJ , e>0 (47) 



cos fc (1 + *)* 



where 



tl = ^ tan" 1 ( (log(l +e)-k log cos (^) ) ^) • (48) 

Pr ( <W < (1 - e)d) < ? \ ~ £j 2 , < e < 1, k > J- (49) 

^ cos* f )cob*5(£) 8 £ 



where 



t * = ^tan- 1 ((-log(l - e) + A; log cos (^)) ^ . (50) 
#y restricting < e < 1, the tail bounds can be written in exponential forms: 

Pr (<W > (1 + e)d) < exp j^-fc^^-y^ (5 1) 

/ 2 \ 2 

Pr (<W < (1 - e)d) < exp ( -fe ^ ^ J , ft > ^7 (52) 



1 + e / ~ 1.5e 
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An analog of the JL bound for l\ follows from the exponential tail bounds d5TT i and d52b . 

Lemma 5 Using d gmc with k > 8 ^ 2 ^°^" + 1 £ ° g<? ^ > then with probability at least 1 — 5, the l\ 
distance, d, between any pair of data points (among n data points), can be estimated with errors 
bounded by ±ed, i.e., \d gm ^ c — d\ < ed. 

Remarks on Lemma HJ (1) We can replace the constant "8" in Lemma |5] with better (i.e., 
smaller) constants for specific values of e. For example, If e = 0.2, we can replace "8" by "5". See 
the proof of LemmaHJ (2) This Lemma is weaker than the classical JL Lemma for dimension reduc- 
tion in li as reviewed in Section 2.1. The classical JL Lemma for I2 ensures that the I2 inter-point 
distances of the projected data points are close enough to the original I2 distances, while Lemma|5] 
merely says that the projected data points contain enough information to reconstruct the original l\ 
distances. On the other hand, the geometric mean estimator is a non-convex norm; and therefore 
it does contain some information about the geometry. We leave it for future work to explore the 
possibility of developing efficient algorithms using the geometric mean estimator. 

Figure |2] presents the simulated histograms of d gm>c for d = 1, with k = 5 and k = 50. The 
histograms reveal some characteristics shared by the maximum likelihood estimator we will discuss 
in the next section: 

• Supported on [0, 00), d gm , )C is positively skewed. 

• The distribution of d gmfi is still "heavy-tailed." However, in the region not too far from the 
mean, the distribution of d gm ^ c may be well captured by a gamma (or a generalized gamma) 
distribution. For large k, even a normal approximation may suffice. 




(a) k = 5 (b) k = 50 

Figure 2: Histograms of d gnhC , obtained from 10 6 simulations. At least in the range not too far 
from the mean, the distribution of d gril:C resembles a gamma and also resembles a normal 
when k is large enough. 



Figure |3] compares d gm , )C with the sample median estimators d me and d me ^ c , in terms of the 
mean square errors. d gmfi is considerably more accurate than d me at small k. The bias correction 
significantly reduces the mean square errors of d me . 
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Figure 3: The ratios of the mean square errors (MSN), MSE ( A rfme ) anc [ 

6 MSE(d 9m , c ) MSE(d gm , c ) 



, demonstrate that 



the bias-corrected geometric mean estimator d gm)C is considerably more accurate than 
the sample median estimator d me . The bias correction on d me considerably reduces the 
MSE. Note that when k = 3, the ratios are oo. 



6. The Maximum Likelihood Estimators 



This section is devoted to analyzing the maximum likelihood estimators (MLE), which are "asymp- 
totically optimum." In comparisons, the sample median estimators and geometric mean estimators 
are not optimum. Our contribution in this section includes the higher-order analysis for the bias and 
moments and accurate closed-from approximations to the distribution of the MLE. 

The method of maximum likelihood is widely used. For example, iLi et al. (2006b) applied the 
maximum likelihood method to normal random projections and provided an improved estimator of 
the li distance by taking advantage of the marginal information. 

The Cauchy distribution is often considered a "challe nging" examp l e because of the " multiple 
roots" problem when estimating the location parameter ( Ba rnettL H%fl : lHaas et all Il970h . In our 
case, since the location parameter is always zero, much of the difficulty is avoided. 

Recall our goal is to estimate d from k i.i.d. samples Xj ~ C(0, d),j = 1, 2, k. The log joint 
likelihood of {xj}j =1 is 



L(xi,x 2 ,...x k ;d) = fclog(d) - Hog(7r) - ^log(s| + d 2 ), 

3=1 



(53) 



whose first and second derivatives (w.r.t. d) are 



L\d) 



L"(d) 



k >. - 

7i ~ 2^ 



2d 



k 



d 2 ' 



2d 2 



k_ ^ 2x] 

d 2 Z^(x 2 +d 2 f 



L'{d) 



4 E 

3=1 



x: 



+ d 2 f 



(54) 
(55) 
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The maximum likelihood estimator of d, denoted by cImle, is the solution to L'(d) = 0, i.e., 



dMLE j=1 Xj + d MLE 

Because L"{(1mle) < 0, duLE indeed maximizes the joint likelihood and is the only solution to 
the MLE equation d56b . Solving d56l numerically is not difficult (e.g., a few iterations using the 
Newton's method). For a better accuracy, we recommend the following bias-corrected estimator: 

dMLE,c = d-MLE f 1 - -j^j ■ (57) 

Lemma |6] concerns the asymptotic moments of dMLE and cImle,c, proved in Appendix|E| 

Lemma 6 Both dMLE and <1mle,c are asymptotically unbiased and normal. The first four moments 
of d mle are 

E(d MLE -d)=i + o[^ (58) 
2d 2 7d 2 „ / 1 



Var{d MLE )= — + w + 0^ ¥ ) (59) 

E [d M LE - E(d M LE)) 3 = + O (J^) (60) 

/ „ „ \ 4 \2d 4 222d 4 / 1 \ 

=^ + ^^ + °( v 7^J (61) 



The first four moments of dMLE, c are 



E{du LL> -d)=o(j^) (62) 

2d 2 3d 2 ( 1 \ 

M<W>) = — + ^ + °( v fc3j (63) 

^ ( d>MLE,c ~ E(d M LE,c)) 3 = + O ( (64) 

/ , , n\ 4 12d 4 186d 4 ^ / 1 \ 

E ( d ML E,c - E(d M LE,c)) =^T + -gr + O ( ¥ J (65) 



The order O (r) term of the variance, i.e., is known, e.g., (Haa s et allll970h . We derive the 
bias-corrected estimator, duLF,., 



, ,., an d the hig her order moments using stochastic Taylor expansions 
< Battled Fl95l lShenton and BowmaJ [l963l lFerrari et a l~l ll996l :ICvs neiros et all EoOlll 

We will propose an inverse Gaussian distribution to approximate the distribution of cImle,c, by 
matching the first four moments (at least in the leading terms). 
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6.1 A Numerical Example 



The maximum likelihood estimators are tested on MSN Web crawl data, a term-by-document matrix 
with D = 2 16 Web pages. We conduct Cauchy random projections and estimate the l\ distances 
between words. In this experiment, we compare the empirical and (asymptotic) theoretical mo- 
ments, using one pair of words. Figure 0] illustrates that the bias correction is effective and these 
(asymptotic) formulas for the first four moments of cLmle,c in Lemma |6] are accurate, especially 
when k > 20. 

1 
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Figure 4: One pair of words are selected from an MSN term-by-document matrix with D = 2 16 
Web pages. We conduct Cauchy random projections and estimate the l\ distance between 
one pair of words using the maximum likelihood estimator (Lmle and the bias-corrected 
version Amle,c- Panel (a) plots the biases of cImle and cLmle,c, indicating that the 
bias correction is effective. Panels (b), (c), and (d) plot the variance, third moment, and 
fourth moment of (Imle,c, respectively. The dashed curves are the theoretical asymptotic 
moments. When k > 20, the theoretical asymptotic formulas for moments are accurate. 



6.2 Approximation Distributions 

Theoretical analysis on the exact distribution of a maximum likelihood estimator is difficult. 3 In 
statistics, the standard approach is to assume normality, which, however, is quite inaccurate. The 

3. In fac t, conditional on the observations xi, X2, Xk, the distribution of (Lmle can be exactly characte rized jFished. 
11934 . ILawlessH 19721) studied the conditional confidence interval of the MLE. Later, Hinklev 1 1978) proposed the 
normal approximation to the exact conditional confidence interval and showed that it was superior to the uncondi- 
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so-called Edgeworth expansion' improves the normal approxim ation by matching higher moments 
I Felleii Il97ll : fBhattacharv a and Ghoshl Il978l : ISeverini KOOOV For example, if we approximate 



the distribution of duLE,c using an Edgeworth expansion by matching the first four moments of 
<Imle,c derived in Lemma|6l then the errors will be on the order of O (/c _3//2 ) . However, Edgeworth 
expansions have some well-known drawbacks. The resultant expressions are quite sophisticated. 
They are not accurate at the tails. It is possible that the approximate probability has values below 
zero. Also, Edgeworth expansions consider the support is (— oo, oo), while d,MLE,c is non-negative. 

We propose approximating the distributions of cImle,c directly using some well-studied com- 
mon distributions. We will first consider a gamma distribution with the same first two (asymptotic) 
moments of <Imle,c- That is, the gamma distribution will be asymptotically equivalent to the normal 
approximation. While a normal has zero third central moment, a gamma has nonzero third central 
moment. This, to an extent, speeds up the rate of convergence. Another important reason why a 
gamma is more accurate is because it has the same support as cImle,c, i-e., [0, oo). 

We will furthermore consider a generalized gamma distribution, which allows us to match the 
first three (asymptotic) moments of cImle,c- Interestingly, in this case, the generalized gamma ap- 
proximation turns out to be an inverse Gaussian distribution, which has a closed-form probability 
density. More interestingly, this inverse Gaussian distribution also matches the fourth central mo- 
ment of dMLE,c in the O (r?) term and almost in the O (p-) term. By simulations, the inverse 
Gaussian approximation is highly accurate. 

Note that, since we are interested in the very small (e.g., 10~ 10 ) tail probability range, O (/c _3//2 ) 
is not too meaningful. For example, k~ 3 / 2 = 10 -3 if k = 100. Therefore, we will have to rely on 
simulations to assess the accuracy of the approximations. On the other hand, an upper bound may 
hold exactly (verified by simulations) even if it is based on an approximate distribution. 

As the related work, iLi et al. (2006e) applied gamma and generalized gamma approximations 



to model the performance measure distribution in some wireless communication channels using 
random matrix theory and produced accurate results in evaluating the error probabilities. 

6.2. 1 The Gamma Approximation 

The gamma approximation is an obvious improvement over the normal approximation. 5 A gamma 
distribution, G(a, j3), has two parameters, a and f3, which can be determined by matching the first 
two (asymptotic) moments of cImle,c- That is, we assume that cLmle,c ~ G(a, j3), with 



n2 2 d 2 3d 2 1 n 2d 3d 

ap = d, a/3 = —j— + T2"' =^ a = 1 T' P = T + T^r (66) 

^ + p- k k 



tional normality approximation. Unfortunately, we can not take advantage of the conditional analysis because our 
goal is to determine the sample size k before seeing any samples. 

4. The so-called Saddlepoint approximation in general improves Edgeworth expansions lJensenlll99l . often very con- 
siderably. Unfortunately, we can not apply the Saddlepoint approximation in our case (at least not directly), because 
the Saddlepoint approximation needs a bounded moment generating function. 

5. In normal random projections for dimension reduction in h, the resultant estimator of the squared li distance has a 
chi-squared distribution (e.g., I Vemoala, 2004, Lemma 1.3)), which is a special case of gamma. 
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Assuming a gamma distribution, it is easy to obtain the following Chernoff bounds 6 : 

Pr (d M LE,c > (1 + e)d) < exp (-a (e - log(l + e))) , e > (67) 
Pr (d M LE,c < (1 - e)d) < exp (-a (-e - log(l - e))) , < e < 1, (68) 



where we use < to indicate that these inequalities are based on an approximate distribution. 

Note that the distribution of dMj.F.ld ( and hence dMLE,c/d) is only a function of k as shown 
in dAntle and BairllT969l : lHaas et all 1970h . Therefore, we can evaluate the accuracy of the gamma 
approximation by simulations with d = 1, as presented in Figure |5] 





0.8 1 



10 — Empirical 

Gamma Bound 

10" 



0.2 0.4 0.6 0.8 

e 



(a) (b) 
Figure 5: We consider k = 10, 20, 50, 100, 200, and 400. For each k, we simulate standard 
Cauchy samples, from which we estimate the Cauchy parameter by the MLE <1mle,c and 
compute the tail probabilities. Panel (a) compares the empirical tail probabilities (thick 
solid) with the gamma tail probabilities (thin solid), indicating that the gamma distribution 
is better than the normal (dashed) for approximating the distribution of <1mle,c- Panel (b) 
compares the empirical tail probabilities with the gamma upper bound d67b + d68l . 



Figure |3a) shows that both the gamma and normal approximations are fairly accurate when the 
tail probability > 10~ 2 ~ 10 -3 ; and the gamma approximation is obviously better. 

Figure |3b) compares the empirical tail probabilities with the gamma Chernoff upper bound 
d67l + d68t . indicating that these bounds are reliable, when the tail probability > ~ 10 -6 . 

6.2.2 The Inverse Gaussian (Generalized Gamma) Approximation 

The distribution of duLE,c can be well approximated by an inverse Ga ussian distribution, which 
is a s pecial case of the three-parameter generalized gamma distribution ( Houg aardl. 19861 : Gerbei , 



1991), denoted by GG(a, (3, rf). Note that the usual gamma distribution is a special case with rj = 1. 



6. Using the Chernoff inequality jChemofl[l952h . we bound the tail probability by Pr (Q > z) = Pr (e Qt > e zt ) < 
E (e^*) e _zt ; and we then choose t that minimizes the upper bound. 
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If z ~ GG(a, j3, rj), then the first three moments are 

E(z) = a/3, Var(z)=a/3 2 , E (z - E(z)f = a/3 3 (l + rj). 
We can approximate the distribution of cImle,c by matching the first three moments, i.e., 

12d 3 



ot(3 

from which we obtain 



a 



d, 



2,3' 
k 



2^_ 3^ 
k + k 2 ' 



2d 3d 



v = 2 + 



k 2 



(69) 



(70) 



(71) 



Taking only the leading term for rj, the generalized gamma approximation of dj^LE c would be 




(72) 



In general, a generalized gamma distribution does not have a closed-form density function al- 
though it always has a closed-from moment generating function. In our case, (1721 is actually an 
inverse Gaussian distribution, which has a closed-form density function. Assuming <1mle,c ~ 
IG(a,(3), with parameters a and /3 defined in (17 It . the moment generating function (MG F), the 
proba bility densit y function (PDF ), and cumulative density function (CDF) would be ( Seshadril 
I1993L Chapter 2) (lTweediel [l9?7lbh 7 

(73) 
(74) 



exp(a(l- (1-2/ft) 1 / 2 

(y/P- 




E ( exp(d M LE,ct) 
Pr(d M LE,c = y) 

Pr [d M LE,c < y 

' \\ y \ a P j 1 \ v y \ a P J 1 

I I rvrl /11 \ \ ~ I I rMf] / 11 \ \ 

(75) 

where <£(.) is the standard normal CDF, i.e., $>(z) = f z —^=e~~dt. Here we use = to indicate 
that these equalities are based on an approximate distribution. 

Assuming <1mle,c ~ IG(a, j3), then the fourth central moment should be 

4 - - ,2X2 



E duLE,c — E duLE 



15a/3 4 + 3(a/3 2 )' 
irJ /2d 3d, 

1M ' T + 7^) + 3 



2f_ 3f_ 



\2d" 156d 4 n 

+ -7^— + 



k 2 



k 3 



(76) 



7. The inverse Gaussian distribution was first noted as the distribution of the first passage time of the Brownian 
motion with a posit i ve drift. It has many interesting properties such as infinitely divisible. Two monographs 
IChhikara and Folksi 11984 ISeshadril \l9<m are devoted entirely to the inverse Gaussian distributions. For a quick 
reference, one can check http://mathworld.wolfram.com/InverseGaussianDistribution.html 
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Lemma|6]has shown the true asymptotic fourth central moment: 

E (d MLE , c - E (d MLE , c ) ) = _ + _ + ^_j. (77) 

That is, the inverse Gaussian approximation matches not only the leading term, ^pr, but also almost 
the higher order term, ls ^f 4 , of the true asymptotic fourth moment of cImle,c- 

Assuming duLE,c ~ IG(a>, (3), the tail probability of cImle,c can be expressed as 

Pr (d MLE . c > (1 + e)d) = d> L e JjO\ - e 2a $ (-(2 + e)J^) > e > (78) 

Pr(W lC < (l-e)d) =^(-e^^) +e 2Q $(-(2-e)y^) , < e < 1. (79) 
Assuming cImle,c ~ IG(a, 0), it is easy to show the following Chernoff bounds: 

Pr (d M LE,c > (1 + e)d) < exp (~ 2 ^ £ J . e > (80) 



ae 2 



Pr (d M £i?,c < (1 - e)d) < exp I - — — - ) , < e < 1. (81) 

To see d80b . Assume z ~ IG(a, j3). Then, using the Chernoff inequality: 

Pr > (1 + e)d) <E (zt) exp(-(l + e)dt) 

= exp (a ( 1 - (1 - 2/?t) 1/2 ) - (1 + e)dt) , 



whose minimum is exp (~ 2 (i+ £ ) ) > attained at i = ^1 — jj^p ) ^j- We can similarly show d8lT l. 
Combining d80l and (UTTl yields a symmetric bound 

Pr (\d ML E,c -d\>ed}j<2 exp l- fl^fi J , < e < 1 (82) 

Figure |6] compares the inverse Gaussian approximation with the same simulations as presented 
in Figure [5] indicating that the inverse Gaussian approximation is highly accurate. When the tail 
probability > 1CT 4 ~ 10~ 6 , we can treat the inverse Gaussian as the exact distribution of (Imle,c- 
The Chernoff upper bounds for the inverse Gaussian are always reliable in our simulation range (the 
tail probability > 10" 10 ). 

7. Conclusion 

It is well-known that the l\ distance is far more robust than the I2 di stance aga i nst "o utliers." 
There are numerou s success stori es of using the U distance, e.g., Lasso ( Tibshiranl. 1996b. LARS 



(Rfro n et al 112 0041 1 -norm SVM Jzhu et alll2003h . and Laplacian radial basis kernel Jchapelle et al 
ll999l:lFerecatu et allEtXMh . 



19 



Li, Hastie, and Church 




0.2 0.4 0. 

e 



10 



.10" 



* 10 4 
o 

-6 



10 



10 



-10 











k=400 ^ \ 




- \ 




- — Empirical 




---IG Bound 





0.2 0.4 0.6 0.8 

e 



(a) (b) 
Figure 6: We compare the inverse Gaussian approximation with the same simulations as presented 
in Figure |3] Panel (a) compares the empirical tail probabilities with the inverse Gaussian 
tail probabilities, indicating that the approximation is highly accurate. Panel (b) com- 
pares the empirical tail probabilities with the inverse Gaussian upper bound d80b+d8ll. 
The upper bounds are all above the corresponding empirical curves, indicating that our 
proposed bounds are reliable at least in our simulation range. 



Dimension reduction in the l\ norm, however, has been proved impossible if we use linear ran- 
dom projections and linear estimators. In this study, we propose three types of nonlinear estimators 
for Cauchy random projections: the bias-corrected sample median estimator, the bias-corrected 
geometric mean estimator, and the bias-corrected maximum likelihood estimator. Our theoretical 
analysis has shown that these nonlinear estimators can accurately recover the original l\ distance, 
even though none of them can be a metric. 

The bias-corrected sample median estimator and the bias-corrected geometric mean estimator 
are asymptotically equivalent but the latter is more accurate at small sample size. We have derived 
explicit tail bounds for the bias-corrected geometric mean estimator and have expressed the tail 
bounds in exponential forms. Using these tail bounds, we have established an analog of the Johnson- 
Lindenstrauss (JL) lemma for dimension reduction in l\, which is weaker than the classical JL 
lemma for dimension reduction in 1%. 

We conduct theoretic analysis on the bias-corrected maximum likelihood estimator (MLE), 
which is "asymptotically optimum." Both the sample median estimator and the geometric mean 
estimator are about 80% efficient as the MLE. We propose approximating its distribution by an 
inverse Gaussian, which has the same support and matches the leading terms of the first four mo- 
ments of the proposed estimator. Approximate tail bounds have been provide based on the inverse 
Gaussian approximation. Verified by simulations, these approximate tail bounds hold at least in the 
> 10~ 10 tail probability range. 

Although these nonlinear estimators are not metrics, they are still useful for certain applications 
in (e.g.,) data stream computation, information retrieval, learning and data mining, whenever the 
goal is to compute the l\ distances efficiently using a small storage space. 
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The geometric mean estimator is a non-convex norm (i.e., the l p norm as p — > 0); and therefore 
it does contain some information about the geometry. It may be still possible to develop certain 
efficient algorithms using the geometric mean estimator by avoiding the non-convexity. We leave 
this for future work. 
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Appendix A. Proof of Lemma U 

Assume x ~ C(0, d). The probability density function (PDF) and the cumulative density function 
(CDF) of | a; | would be 

Pr(|x| = z) = z>0 (83) 

2 z 

Pr(|x| < z) = -tan -1 -, z>0 (84) 

7T d 

The asymptotic normality of d me follows from the asymptotic results on sample quantiles (Shao, 
200l Theorem 5.10). 



Vk (d me -d)^NUUl- / (Pr(|x| = z)\ z=d ) 2 ^j = N (o, ^d 2 ) (85) 

The pro bability density of d me can be derived from the probability density of order statistics 
(Shaol 2003 . Example 2.9). For simplicity, we only consider k = 2m + 1, m = 1, 2, 



Pr(d me = z)= {2 ™t] )[ (Pr(|x| < z)) m (1 - Pr(|x| < z)) m Pr(|x| = z) 



(2m + 1)!/ 2 iz\ m ( 2 , z\ m 2d 1 

' tarT 1 - l--tan~ 1 - = (86) 



(m!) 2 \7T d) \ 7T d) 7T z 2 + d? 
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The r th moment of d me would be 



t- / , > # r (2m + l)!/2 ,z\ m / 2 , _ , ... 







' L^^^^it-tTdt, (87) 



o 



[ml 



by substituting t = - tan 1 |. 

When £ 1 - 0, tan (ft) oo, but t - t 2 = t(l - t) 0. Around t = 1 - 0, tan (ft) 



- + by the Taylor expansion. Therefore, in order for E ( d me ] < oo, we 



tan(f (1-t)) 

must have m > r. 

We complete the proof of Lemmaf^ 

Appendix B. Proof of Lemma |2| 

Assume x ~ C(0, d). The first moment of log(|x|) would be 



E log x = — / 2 

7 y + d 



1 [°° \og{d)y-V 2 | 1/2 log^y 1 / 2 



Wo y + 1 y + ! 

log(d), (88) 



with the help of the integral tables dGradshtevn and RvzhikLll994l 3.221.1, 4.251.1). 

Thus, given i.i.d. samples xj ~ C(0, d), j = 1, 2, fc, a nonlinear estimator of d would be 

/ 1 A \ 

(89) 



dj os = exp I - ^2 |) I 



We can derive another non linear estimator from E (|x| A ), |A| < 1. Using the integral tables 



(Gradshteyn and Ryzhik, 1994, 3.221.1), we obtain 



d x f°° yV 

-ay 



2d f°° y x 



n Jo y + i 

d A 

cos(A7r/2) ' 

from which a nonlinear estimator follows immediately 

l/A 



(90) 



i^|x/cos(Avr/2)j 



dx= I ^> Jx J | A cos(Avr/2) | , |A| < 1 (91) 
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Both nonlinear estimators di nc and d\ are biased. The leading terms of their variances can 
obtained by the Delta Method JShaoLl2003l Corollar y 1.1). 



With the help of (Gradshteyn and Ryzhik, 1994, 4.261.10), we obtain 



it 2 . „ vr 2 



E (log 2 (|x|)) = \og\d) + — , i.e., Var (log 2 (|x|)) 

Thus, 

By the Delta Method, the asymptotic variance of d\ og should be 

Var (d lo9 ) = * £ exp 2 (log(.)) + O ( ^ ) = ^ + O ( ^ ) . 



ITT 2 

jfe 4 ' 



Similarly, the asymptotic variance of d\ is 



w , , . d 2 sin 2 (A7r/2) _ / 1 , , , 



Var MaJ — > oo as |A| — ► ^- Var (d\J converges to Var \ di og j as A — > 0, because 

, sin 2 (Avr/2) vr 2 
hm— - — — = — . 

A^o A 2 cos(A7r) 4 

This completes the proof of Lemma[2] 



Appendix C. Proof of Lemma |3] 

Assume that x\, X2, Xk, are i.i.d. C(0, d). The estimator, d gm:C , expressed as 

k 

4- c =cos fc (^)ni^i i/fc 5 

i=i 

is unbiased, because, from Lemma|2j 

E(4 m , c )=co S fc (^)n E (i,/A) 
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The variance is 

Var(<w) =co S 2k (^-)HE(\x 3 \ 2 / k ) -d 




2 



(99) 



32 k 2 \k 3 



because 

cos 



2k i£) _ n , i 



cos fc 



+ 



f) V 2 2 VCOs(7t//c) 



1 7T 2 5 7T 4 / 1 

/lvr2 _5_7r^\ fe(fc-l) /Itt 2 5_J^Y 
+ V4fc 2 + 4 8fc 4 / + 2 V4fc 2+ 48&V + "' 

Some more algebra can similarly show the third and fourth central moments: 

E (cW - E (cW)) 3 = ^ + O (1) (102) 

e (4 m , c - e (<w)) 4 = ^ + (f) • (103) 

Therefore, we have completed the proof of Lemma|3] 
Appendix D. Proof of Lemma |4| 

This section proves the tail bounds for d gmfi . Note that d grriiC does not have a moment generating 

function because E [d gm ^ = oo if t > k. However, we can still use the Markov moment bound. 8 
For any e > and < t < k, the Markov inequality says 

E (dnm. ^\ n^ kt ( JL 

P 



9 m,c) COS"" I 3*, 



n^^^)W VB)";.y ' <104) 

which can be minimized by choosing the optimum t = £*, where 

t* = ^ tan -i^l og (l + e )_ A; logco S (^))^ . (105) 



8. In fact, even when the moment generating function does exist, for any positive random variable, the Markov moment 
bound is always sharper t h an the Chernoff bound, although the Chernoff bound will be in an exponential form. See 
IPhilips and Nelsonl il99l : lLugosil 120041) . 
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We need to make sure that < t* < k. t\ > because logcos(.) < 0; and t* < k because 
tan _1 (.) < |, with equality holding only when k — > oo. 

For < e < 1, we can prove an exponential bound for Pr [d gm ,c > (1 + e)dj. First of all, 
note that we do not have to choose the optimum t = t±. By the Taylor expansion, for small e, t\ can 
be well approximated by 



4ke 1 4/ce 

7T Z 2 7T^ 



Therefore, taking i = i|* = the tail bound becomes 

(\ COS^l* f — 1 

7 cos fc (1 + 6)* 



2fc 

(f ) 



1 



< 



COs(|)(l+6) 4 ^ 2 ; 

: 1 V 

CO S (|)(l+6) 4 ^ 2 J 

exp ( -k (log (cos ( ^ ) J + -| log(l + e) 



7T / / 7T^ 



f 2 



<exp( ~ fe 8(1 + e) )> 0<e< 1 (107) 
The last step in dl07t needs some explanations. First, by the Taylor expansion, 

log ^cos ^\ J + ^ log(l + e) 

2e 2 4 e 4 \ 4e / 1 2 
2~ q— + - +— e " o e +" 

7T Z 3 7P 1 / 7T Z \ 2 

2e 2 

=— l-e + ... (108) 

Therefore, we can seek the smallest constant 71 so that 

/ /2e\\ 4e e 2 e 2 

log cos - + — log(l + e) > — — — = _(l_ e+ .„) (109) 
V V 71 "// ^ 7i(! + e) 7i 

It is easy to see that as e — > 0, 71 — > 4^. Figure H a ) illustrates that it suffices to let 71 = 8, 
which can be numerically verified. This is why the last step in dl07t holds. Of course, we can get a 
better constant if (e.g.,) e = 0.5. 
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Now we need to show the other tail bound Pr (d gm ,c < (1 — e)dj '■ 

Pr (d gm>c < (1 - e)d) = Pr fcos (^) * J] I*//* < (1 - e)d\ 



A: 



=Pr [ exp [ £log (l^r*/*) ] > exp ( -ilog ( ) ) ] , < t < k 

vi =1 



: i - o \ * i 



^ I ^TTVY — w^V (Chernoff bound) (110) 



COS \2k) 1 COS 



\2k) 

which is minimized at t = 



t * = ^tan- 1 ((-log(l - e) + k log cos (^)) ^ , (111) 



provided > otherwise t* 2 may be less than 0. 
Again, t% can be replaced by its approximation 



Akf 

t*n t ? = —, (112) 

7T 

2 

provided > otherwise the probability upper bound may exceed one. Therefore, 



>r (4m, c < (1 - e)d) < 



(i-o y __ 

COS>C (jk) J COS k 



: exp I -A; I log I cos — J - ^| log(l - e) + ^ log (cos ^- 



We can bound ^ log (cos ^r) by restricting k. 

In order to attain Pr (d gm ^ c < (1 — e)dj < exp (—k ^(l+e) ))' we nave to restr i ct ^ to ^ e 

2 

larger than a certain value. For no particular reason, we like to express the restriction as k > — e , 

2 

for some constant 72. We find k > j-^- e suffices, although readers can verify that a slightly better 
(smaller) restriction would be k > 



1 1 - 2 



4/tt 2 -1/4 i ~~ 1.5326e a 



If > fk' then log (cos |f) > § log (cos ^). Therefore, 

Pr (dgm,c < (1 - e)^) < exp f — A; ^log ^cos ^ - ^ log(l - e) + ^ log (cos 7^))) 



e 2 \ , . vr 2 



<exp -/c— , k> (113) 

_ 1 8(1 + e / ' ~ 1.5e 



This completes the proof of Lemma|4] 
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Figure 7: (a) 



(a) 
'Vit+-> 



(b) 



as a function of e. (b): 



■ 2 /(1+e) 



log(cos(f)) + i|log(l+ e ) W- l og ( cos ggpg log(l-e)+§ log(cos ^) 

as a function of e. Graphically, we know that it suffices to use a constant 8 in (I107t and 
(I113t . The optimal constant will be different for different e. For example, if e = 0.2, we 
could replace the constant 8 by a constant 5. 

Appendix E. Proof of Lemma |6| 

Assume x ~ C(0, d). The log likelihood (/(x; d)) and first three derivatives are 



l(x; d) = log(d) - log(vr) - log(x 2 + d 2 

l"(d) - 



d x 2 + d 2 

1 2x 2 - 2d 2 



d 2 (x 2 + d 2 ) 2 



d 3 



+ 



M 



+ 



8d(x 2 - d 2 ) 



(114) 
(115) 

(116) 
(117) 



(x 2 + d 2 ) 2 (x 2 + d 2 ) 3 

The MLE cImle is asymptotically normal with mean d and variance ^jpy, where 1(d), the 
expected Fisher Information, is 

2_ / x 2 -d 2 

d? +2E \{x 2 + d 2 ) 2 



I = 1(d) 



-l"(d)) 



1 

2d 2 "' 



(118) 



because 



d 2 



[x 2 + d 2 ) 2 



oo x 2 _ d 2 



-dx 



, (x 2 + d 2 ) 3 
d W 2 d 2 (tan 2 (t) - 1) 



d 



7T J_ 


tt/2 


1 


/•t/2 


d% 


J -tt/2 


1 




dV 


(I" 



d 6 /cos 6 (t) cos 2 (i) 
cos 2 (i) - 2cos 4 (t)dt 



df 



2^ 



1 

id 2 



(119) 
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Therefore, we obtain 



Var(,/ WL/J ) =^- + ()[^). (120) 



General formulas for the bias and higher moments of the MLE ar e available in JBartlettL Il953 



Shenton and Bowman, 1963). We need to evaluate the expressions in (Shenton and Bowman, 1963 



16a-16d), involving tedious algebra: 

E { 1 ^)= d -'§! +0 {h) <121) 

E Umle ~ E (d MLE ) ) 3 = [1 ] ~5 12] + O f (123) 



fc 2 I 2 u 3 . 

E ( dMLE ~ E ( dj\,jLE ) ) = T^T9 + "TT I ~~ T9 + ' ' ' ' " 



/fc 2 I 2 fc 3 \ I 2 I 4 

^^jtaaHiM)^^, <124) 

where, after re-formatting, 

[12] = E(Z') 3 + E(/T ), [l 4 ] = E(Z') 4 , [1 2 2] = E(r'(/') 2 ) + E(/') 4 , 

[13] = E(/') 4 + 3E(Z"(0 2 ) + E(Z'Z'"), [l 3 ] = E(/') 3 . (125) 

We will neglect most of the algebra. To help readers verifying the results, the following formula 
we derive may be useful: 

1 Y m 1 x 3 x 5 x ... x (2m- 1) 1 

' - m = 1,2,3,... (126) 



x 2 + d 2 ) 2 x 4 x 6 x ... x (2m) d 2m ' 

Without giving the detail, we report 

E (o 3 = o, E (ir)=-~, e(o 4 31 



2d 3 ' v ' 8d^ 

E(^(0 2 ) = -^, E ( /r ') = ^- (127) 

Hence 



E ( dj^LE ) = d + y + O ( ) (129) 



Thus, we obtain 

k \k 2 

f 7d^ 

k k 2 \k 3 



^■(<Iull)=^+ 7 S + 0[- 1 ) (130) 



E (d M LE ~ E (d M LE)) 3 = + O (j^J (131) 

E (d M LE -E (dMLE)) =^r + ^ + °(^)- (132) 
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Because duLE has O (r) bias, we recommend the bias-corrected estimator 



"'j / LL'.< - <I\ILL ( 1 ~ £ ) • ( 1 33) 



whose first four moments are 



E(d M LE,c) =d + o(Jp) (134) 
,r ft \ 2d 2 3d 2 

Var (d M LE,c) = — + ^ + [p) ( 135 ) 

E (d MLE , c ~ E (d M LE,c^ = ^jrf- + O f-^\ (136) 

4 12d 4 186d 4 „ / 1 



E ( dj[,fLE,c ~ E ( dj[,fLE,c ) ) = -p- + fc3 +Q|^|, (137) 



by brute-force algebra. First, it is obvious that 



2d 2 8d 2 ( 1 



Then 

2 



Var ( dMLE,c ) = E ( duLE,c ~ E(dMLE, 



1\ . ./l^ 2 



^"- i )( 1 -s)-s +0 (iP 

= E(, i „ 1£ - (i ) 2 (l-^ + ^-2^( 1 -i) + 0(^) 
2d 2 3d 2 „ / IN 

= -+i^ +0 (f> <139) 

We can evaluate the higher central moments of <1mle,c similarly, but we skip the algebra. 
Therefore, we have completed the proof for Lemma|6] 
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